Indexing Finite Language Representation of Population 

Genotypes 



Jouni Siren*, Niko Valimaki **, and Veli Makinen 

Helsinki Institute for Information Technology (HUT) & 
Department of Computer Science, University of Helsinki, Finland 
{ j ltsiren , rival imak , vmakinen}@cs . helsinki . f i 



Abstract. With the recent advances in DNA sequencing, it is now possible to have complete genomes 
of individuals sequenced and assembled. This rich and focused genotype information can be used to do 
different population-wide studies, now first time directly on whole genome level. We propose a way to 
index population genotype information together with the complete genome sequence, so that one can use 
the index to efficiently align a given sequence to the genome with all plausible genotype recombinations 
taken into account. This is achieved through converting a multiple alignment of individual genomes 
into a finite automaton recognizing all strings that can be read from the alignment by switching the 
sequence at any time. The finite automaton is indexed with an extension of Burrows- Wheeler transform 
to allow pattern search inside the plausible recombinant sequences. The size of the index stays limited, 
because of the high similarity of individual genomes. The index finds applications in variation calling 
and in primer design. On a variation calling experiment, we found about 1.0% of matches to novel 
recombinants just with exact matching, and up to 2.4% with approximate matching. 

1 Introduction 

Due to the advances in DNA sequencing [19], it is now possible to have complete genomes of 
individuals sequenced and assembled. Already several human genomes have been sequenced [24, 10, 
12, 25, 16] and it is almost a routine task to resequence individuals by aligning the high-throughput 
short DNA reads to the reference [7]. This rich and focused genotype information, together with 
the more global genotype information (common single nucleotide polymorphisms (SNPs) and other 
variations) created using earlier techniques, can be combined to do different population-wide studies, 
now first time directly on whole genome level. 

We propose a novel index structure that simultaneously represents and extrapolates the geno- 
type information present in the population samples. The index structure is built on a given multiple 
alignment of individual genomes, or alternatively for a single reference sequence and set of SNPs 
of interest. The index structure is capable of aligning a given pattern to any path taken along the 
multiple alignment, as illustrated in Figure l. 1 

To build the index, we first create a finite automaton recognizing all paths through the multiple 
alignment, and then generalize Burrows- Wheeler transform (BWT) [2] -based self-index structures 
[21] to index paths in labeled graphs. The backward search routine of BWT-based indexes gener- 
alizes to support exact pattern search over the labeled graph in 0(m) time, for pattern of length 
m. On general labeled graphs, such index can take exponential space, but on graphs resulting from 
finite automaton representation of multiple alignment of individual genomes, the space is expected 
to stay limited. 

Applications for our index include the following: 

* Funded by the Finnish Doctoral Programme in Computational Sciences, Academy of Finland project 1140727 and 
the Nokia Foundation. 

** Funded by Helsinki Graduate School in Computer Science and Engineering and Finnish Centre of Excellence for 
Algorithmic Data Analysis Research 
1 It is also possible to limit to only those paths that contain recombination hotspots [20], to avoid matching to too 
rare recombinants. For brevity, we cover here only the unrestricted case. 
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Fig. 1. Pattern AGCTGTGT matching the multiple alignment when allowing it to change row when necessary. 



— SNP calling. We can take the known SNPs into account already in the short read alignment 
phase, instead of the common pipeline of alignment, variation calling, and filtering of known 
SNPs. This allows more accurate alignment, as the known SNPs are no longer counted as errors, 
and the matches can represent novel recombinants not yet represented in the database. Hence, 
we can expect to output less false positive candidates for the postprocessing steps such as for the 
common realignment step that creates a consensus for indels by multiple alignment of nearby 
mapped reads. 

— Probe/primer design. When designing probes for microarrays or primers for PCR, it is important 
that the designed sequence does not occur even approximately elsewhere than in the target. Our 
index can provide approximate search not only against all substrings, but also against plausible 
recombinants, and hence the design can be made more selective. 

— Large indel calling. After short read alignment, the common approach for detecting larger indels 
is to study the uncovered regions in the reference genome. If there is a deletion in the donor, 
there should be an almost equal size gap in the mapping result. If there is an insertion in the 
donor, there should be a pair of positions in the reference covered by no single read (in the 
perfect world). We can model deletions with our index by adding an edge to the automaton 
skipping each plausible deleted area. For insertions, we can apply de novo sequence assembly 
with the unmapped reads (and pair of fixed start and end sequences representing the prefix 
and suffix of the predicted insert position) to generate plausible insertions, and add these as 
paths to the automaton, (or even multiple paths read from an overlap graph, without fixing the 
assembly) . Realigning all the reads with the index built after adding the plausible deletions and 
insertions to the automaton then gives a voting result to call for the large variants. 

— Reassembly of donor genome. Continuing from variation calling, one can use the realignment of 
the reads to the refined automaton to give a probability for each edge. It is then easy to extract, 
for example, the most probable path through the automaton as a consensus for the donor. 

We made some feasibility experiments on SNP calling problem. We created our index on a 
multiple alignment of four instances of the 76 Mbp human chromosome 18. The total size of the 
index was about 67 MB. Aligning a set of 10 million Illumina Solexa reads of length 56 took 
18 minutes, and about 1.0% of exact matches were novel recombinants not found when indexing 
each chromosome instance separately (see Sect. 6). Leaving these exact matches out from variation 
calling reduces the number of novel SNPs from 4203 to 1074. With approximate matching, the 
proportion of novel recombinants increased to 2.4%. 

Related work and extensions. Jumping alignments of protein sequences were studied in [23] as an 
alternative to profile Hidden Markov Models. They showed how to do local alignment across a 
multiple alignment of protein family so that jumping from one sequence to another is associated 
with a penalty. This gives an alternative to the Markov model approaches to decide whether a given 
protein is part of the protein family. The computation requires dynamic programming through the 
multiple alignment. We study the same problem but from a different angle; we provide a compressed 
representation of the multiple alignment with an efficient way to support pattern matching. 
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Calling of large indels similar to our approach was studied in [1]. One difference is that they 
manage to associate probabilities to the putative genotypes, resulting into a more reliable calling 
of likely variants. However, their indexing part is tailored for this specific problem, whereas we 
develop a more systematic approach that can be generalized to many directions. For example, we 
can take the probabilities into account and index only paths with high enough probabilities, to 
closely simulate their approach (see Sect. 7). 

Our work builds on the self-indexing scenario [21], and more specifically is an extension of the 
XBW transform [5] that is an index structure for labeled trees. Our extension to labeled graphs 
may be of independent interest, as it has potentially many more applications inside and outside 
computational biology. 

The focus of this paper is the finite automaton representation of a multiple alignment. This 
setting is closely related to our previous work on indexing highly repetitive sequence collections 
[17]. In our previous work, we represented a collection of individual genomes of total length N, with 
reference sequence of length n, and a total of s mutations, in space 0(nlog ^ + slog 2 N) bits in 
the average case (rough upper bounds here for simplicity). Exact pattern matching was supported 
in 0(m log N) time. The index proposed in this paper achieves 0(n(l + s/n) ^™)) bits in the 
expected case for constant-sized alphabets. 

The paper is organized as follows. Section 2 introduces the notation, Sect. 3 reviews the neces- 
sary index structures we build on, Sect. 4 describes the new extension to finite languages, Sect. 5 
shows how to construct the new index given a multiple alignment, Sect. 6 gives some preliminary 
experiments on SNP calling problem, and Sect. 7 concludes the work by sketching the steps required 
for making the index into a fully applicable tool. 

2 Definitions 

A string S = S[l,n] is a sequence of characters from alphabet E = {1,2, . . . , a}. A substring of S 
is written as S[i,j]. A substring of type <S[l,j] is called a prefix, while a substring of type 5[i,n] is 
called a suffix. A text string T = T[l, n] is a string terminated by T[n] = $ G" E with lexicographic 
value 0. The lexicographic order "<" among strings is defined in the usual way. 

A graph G = {V, E) consists of a set V = {v\, . . . , v\y\} of nodes and a set E C V 2 of edges such 
that (v, v) G" E for all v G V. We call (u, v ) G E an edge from node u to node v. A graph is directed, 
if edge (u,v) is distinct from edge (v,u). In this paper, the graphs are always directed. For every 
node v £ V, the indegree in(v) is the number of incoming edges (u, v), and the outdegree out(v) the 
number of outgoing edges (v,w). 

Graph G is said to be labeled, if we attach a label i(v) to each node v G V. A path P = u\ ■ ■ ■ u\p\ 
is a sequence of nodes from u\ to u\ P \ such that (ui,Ui + \) G E for all i < \P\. The label of path P 
is the string £{P) = £(ui) • • • l(u\p\). A cycle is a path from a node to itself containing, at least one 
other node. If a graph contains no cycles, it is called acyclic. 

A finite automaton is a directed labeled graph A = (V, E). 2 The initial node v\ is labeled with 
£(yi) = # with lexicographic value a + 1, while the final node v\y\ is labeled with £(wyi) = $. The 
rest of the nodes are labeled with characters from alphabet U. Every node is assumed to be on 
some path from v\ to v\y\. 

The language L{A) recognized by automaton A is the set of the labels of all paths from v± to 
v\y\. We say that automaton A recognizes a string S G L(A), and that a suffix S' can be recognized 
from node v, if there is a path from v to V\y\ with label S'. Note that all strings in the language 
are of form where x is a string from alphabet E. If the language contains a finite number of 

2 Unlike the usual definition, we label nodes instead of edges. 
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strings, it is called finite. A language is finite if and only if the automaton is acyclic. Two automata 
are said to be equivalent, if they recognize the same language. 

Automaton A is forward (reverse) deterministic if, for every node v G V and every character 
c G U U {#,$}, there exists at most one node u such that i{u) = c and (v,u) G E {(u,v) G -E). 
For any language recognized by some finite automaton, we can always construct an equivalent 
automaton that is forward (reverse) deterministic. 

3 Compressed indexes 

The suffix array (SA) of text T[l,n] is an array of pointers SA[l,n] to the suffixes of T in lexi- 
cographic order. As an abstract data type, a suffix array is any data structure that supports the 
following operations efficiently: (a) find the SA range containing the suffixes prefixed by pattern P; 
(b) locate the suffix SA[i] in the text; and (c) display any substring of text T. 

Compressed suffix arrays (CSA) [8,6] support these operations. Their compression is based 
on the Burrows- Wheeler transform (BWT) [2], a permutation of the text related to the SA. The 
BWT of text T is a sequence BWT[l,n] such that BWT[i] = T[SA[z] - 1], if SA[i] > 1, and 
BWT[i] = T[n} = % otherwise. 

BWT can be reversed by a permutation called LF -mapping [2,6]. Let C[l,<r] be an array 
such that C[c] is the number of characters in {$, 1, 2, . . . , c — 1} occurring in the BWT. We also 
define C[0] = C[$] = and C[a + 1] = n. We define LF-mapping as LF(i) = C[BWT[i]] + 
ran/cBWT[j] (BWT, i), where rank c ( BWT, i) is the number of occurrences of character c in prefix 
BWT[l,t]. 

The inverse of LF-mapping is = select c (BV\IT , i — C[c]), where c is the highest value with 
C[c] < i, and se£ect c (BWT, j) is the position of the jth occurrence of character c in BWT [8]. By 
its definition, function IF is strictly increasing in the range [C[c] + 1, C[c + 1]] for every c G U. Note 
that T[SA[i]] = c and BWT[#(i)] = c for C[c] <i<C[c+l]. 

These functions form the backbone of CSAs. As SA[LF(i)] = SA[i] - 1 [6] and hence SA[<F(z)] = 
SA[i] + 1, we can use these functions to move the suffix array position backward and forward in 
the sequence. The functions can be efficiently implemented by adding some extra information to a 
compressed representation of the BWT. Standard techniques [21] for supporting SA functionality 
include using backward searching [6] for find, and sampling some suffix array values for locate and 
display. 

XBW [5] is a generalization of the Burrows- Wheeler transform for labeled trees, where leaf 
nodes and internal nodes are labeled with different alphabets. Internal nodes of the tree are sorted 
lexicographically according to path labels from the node to the root. Sequence BWT is formed by 
concatenating the labels of the children of each internal node in lexicographic order according to 
the parent node. Every internal node v now corresponds to a substring BWT[sp„, ep v ] containing 
the labels of its children. The first position sp v of each such substring is marked with a 1-bit in bit 
vector F. Backward searching is used to support the analogue of find. Tree navigation is possible 
by using BWT and F. 

4 Burrows- Wheeler transform for finite languages 

In this section, we generalize the XBW approach to finite automata. We call it the generalized 
compressed suffix array (GCSA). For the GCSA to function, we require that the automaton is 
prefix-sorted. Refer to Section 5 on how to transform an automaton into an equivalent prefix-sorted 
automaton. 
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Definition 1. Let A be a finite automaton, and let v £ V be a node. Node v is prefix-sorted by 
prefix p{v), if the labels of all paths from v to v\y\ share a common prefix p(v), and no path from 
any other node u / v to v\y\ has p(v) as a prefix of its label. Automaton A is prefix-sorted, if all 
nodes are prefix-sorted. 

Every node of a prefix-sorted automaton A corresponds to a lexicographic range of suffixes of 
language L(A). These ranges do not overlap for any two nodes. 

In XBW, bit vector F is used to mark both nodes and edges. If node v has lexicographic rank 
i, the labels of its predecessors are B\NT[sp v , ep v ] = B\NT[selecti(F, i), selecti(F, i + 1) — 1]. On the 
other hand, if node u is & child of node v, and BWT[j] contains the label of node u, then LF(j) 
is the lexicographic rank of the label of the path from node u through node v to the root. Hence 
selecti(F, LF(j)) gives us the position of edge (u,v). 

While the latter functionality is trivial in trees, a node can have many outgoing edges in a finite 
automaton. Hence we will use another bit vector M to mark the outgoing edges. 

Let A = (V, E) be a prefix-sorted automaton. To build GCSA, we sort the nodes v G V according 
to prefixes p(v). For every node v G V, sequence BWT and bit vectors F and M contain range 
[sp v ,ep v ] of length n(v) = max(m(i;), out(v)). See Figure 4 and Table 1 for an example. 

— B\NT[sp v , ep v ] contains the labels l(u) for all incoming edges (u, v) G E, followed by n(v) — in{v) 
empty characters. 

— F[sp v ] = 1 and F[i] = for sp v < i < ep v . 

— M[sp v , ep v ] contains out{v) 1-bits followed by n{v) — out(v) 0-bits. For the final node v\y\, the 
range contains one 1-bit followed by 0-bits. 

Array C is used with some modifications. We define C[o~ + 1] = C[#] in the same way as for 
regular characters, while C[o~ + 2] is set to be \E\. Assuming that each edge (u,v) G E has an 
implicit label £(u)p(v), we can interpret C[c] as the number of edges with labels smaller than c. We 
write char{i) to denote character c such that C[c] < i < C[c + 1]. 

4.1 Basic navigation 

Let [sp v , ep v ] be the range of BWT corresponding to node v G V. We define the following functions: 

— LF([sp v , ep v ], c) = [sp u , ep u ], where i(u) = c and (u, v) G E, or if no such u exists. 

— #{[sp u ,ep u \) = {[sp v ,ep v ] | (u,v) G E}. 

— £{[sp v ,ep v ]) = £(v). 

These are generalizations of the respective functions on BWT. LF can be used to move backwards 
on edge (u, v) such that l(u) = c, while \P lists the endpoints of all outgoing edges from node u. 
These functions can be implemented by using BWT, F, M, and C, as seen in Figure 2. 

Line 1 of LF is similar to the regular LF, determining the rank of edge label cp(v) among all 
edge labels. On lines 2 and 3, we determine if there is an occurrence of c in B\NT[sp v , ep v \. On 
line 4, we find the position of edge (it, v) in bit vector M, and on lines 5 and 6, we find the range 
[sp u , ep u ] containing this position. 

In function we determine the ranks of the outgoing edges from node u on lines 11 and 12. 
Line 14 is similar to the regular 'F, determining where the label of node u corresponding to the 
edge of rank i occurs in BWT. Lines 15 and 16 are similar to lines 5 and 6 in LF, converting the 
position j to the range [sp v , ep v ] corresponding to the destination node v. 
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function LF([sp v , ep v ], c): 

1 i 4- C[c] + rank c (B\NT, ep v ) 

2 if select C (BW 'T ', i - C[c]) < sp v : 

3 return 

4 i «- selecti(M,i) 

5 sp u 4— select\{F,ranki(F,i)) 

6 ep„ <s— selecti(F, ranki(F, i) + 1) 

7 return [sp u ,ep u ] 



1 14 j «- se/ect c (BWT, i - C[c]) 

15 sp v selecti(F,ranki(F, j)) 

16 ep„ selecti(F,ranki(F,j) + 1) — 1 

17 res res U {[sp„, ep„]} 

18 return res 



function ^([sp^, ep„]): 

9 c i- £([sp u ,ep u ]) 

10 res «- 

11 low rank\{M, sp u ) 

12 ftip/i <— ranki(M, ep u ) 

13 for i <s— low to /ug/i: 



function £([sp„, ep v ]): 

8 return char(ranki(M, sp v )) 



Fig. 2. Pseudocode for the basic navigation functions LF, \P, and i. 



4.2 Searching 

As the generalized compressed suffix array is a CSA, most of the algorithms using a CSA can be 
modified to use GCSA instead. In this section, we describe how to support the basic SA operations 
(see Sect. 3): 

— find(P) returns the range [sp, ep] of BWT corresponding to those nodes v, where at least one 
path starting from v has pattern P as a prefix of its label. 

— locate( [sp v , ep v ] ) returns a numerical value corresponding to node v. 

— display([sp u , ep u ], k) returns a prefix of the label of the path starting from node u. Stops when 
the prefix has length k or when there are multiple or no outgoing edges from the current node. 

We use backward searching [6] to support find. The algorithm maintains an invariant that 
[spi,epi] is the range returned by find(P[i,m]). In the initial step, we start with the edge range 
[C[P[m]] + l,C[P[m] + 1]], and convert it to range [sp m ,ep m ] by using bit vectors M and F. The 
step from [spi+i, epi+i] to [spi, epi] is a generalization of function LF for a range of nodes. We find 
the first and last occurrences of character P[i] in BWT[spj+i, e£>j+i], map them to edge ranks by 
using C and BWT, and convert the ranks to spi and epi by using F and M. 

For locate, we assume that there is a (not necessarily unique) numerical value id(v) attached 
to each node v G V. Examples of these values include node ids (so that id(vi) = i) and positions 
in the multiple alignment. To avoid excessive sampling of node values, id{v) should be id(u) + 1 
whenever (n, v) is the only outgoing edge from u and the only incoming edge to v. 

We sample id(u), if there are multiple or no outgoing edges from node u, or if id(v) / id(u) + 1 
for the only outgoing edge (u, v). We also sample one out of d node values, given sample rate d > 0, 
on paths of at least d nodes without any samples. The sampled values are stored in the same order 
as the nodes, and their positions are marked in bit vector B (B[sp u ] = 1, if node u is sampled). 

Node values are retrieved in a similar way as in CSAs [21]. To retrieve id(u), we first check 
if B[sp u ] = 1, and return sample rank±(B, sp u ), if this is the case. Otherwise we follow the only 
outgoing edge (u, v) by using function 1^, and continue from node v. When we find a sampled node 
w, we return id(w) — k, where k is the number of steps taken by using <F. 

Display is implemented in a straightforward way. We first output £([sp u , ep u ]), and then use 
\P([sp u , ep u ]) to get the outgoing edges. If there are multiple or no edges, we stop. Otherwise (u,v) 
is the only edge, and we continue from node v until k characters have been output. 

A better analogue to the display of suffix arrays would be display(id(u) , k) , extracting prefixes 
of path labels from all nodes v G V with node value id(u). To implement such operation, we need 
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am efficient way to map node values to BWT ranges. While some node value schemes allow such 
mapping easily, there is no obvious way to do so in other schemes. 

4.3 Analysis 

For each node v £ V, the length of range [sp v , ep v ] is the maximum of in(v) and out(v). As every 
node must have at least one incoming edge and one outgoing edge (except for the initial and the 
final nodes), the length of BWT is at most 2\E\ - \V\ + 2. 

Similar size bounds as for different variants of the CSA can be defined for compressed repre- 
sentations of BWT, if we first define a generalization of empirical entropy. Bit vectors F and M 
have \ V\ and \E\ 1-bits out of |BWT|, respectively. The number and the size of the samples depend 
greatly on the node value scheme used. 

Theorem 1. Assume that rank and select on bit vectors require 0{ts) time. GCSA with sample 
rate d supports Rnd(P) in 0(\P\ • ts) and locate ([sp v , ep v ]) in 0(d • ts) time. 

Proof. We use bit vectors \P C that mark the occurrences of character c G EU {#} to encode BWT. 
This reduces rank and select on BWT to the same operations on bit vectors. Basic operations I 
and LF take O(ie) time, as they require a constant number of bit vector operations. <F also takes 
0{ts) time, if the current node has outdegree 1. As find does one generalized LF per character of 
pattern, it takes 0(|P| • ts) time. 

Operation locate checks from bit vector B if the current position is sampled, and follows the 
unique outgoing edge using \P if not. This requires a constant number of bit vector operations per 
step. As a sample is found within d—1 steps, the time complexity is 0(d ■ □ 

4.4 Languages recognized by a prefix-sorted automaton 

GCSA can be extended from finite languages to some infinite languages as well. To define the class 
of languages that can be indexed with our approach, we relax the requirement that the automaton 
should be prefix-sorted. 

Definition 2. Let A = (V, E) be a finite automaton, and let v G V be a node. Let rng(v) be the 
smallest (open, semiopen, or closed) lexicographic range containing all suffixes that can be recognized 
from node v. Node v is prefix-range-sorted, if no suffix S G rng(v) is recognized from any other 
node v' / v. Automaton A is prefix-range-sorted, if all nodes are prefix-range-sorted. 

The definition states that the ranges of suffixes recognized from two nodes must not overlap. 
When this is true, the incoming edge encoded by character c of rank i in BWT is the same edge as 
the outgoing edge encoded by the 1-bit of rank C[c] + i in bit vector M. 

Theorem 2. The class of languages recognized by prefix-range-sorted automata is strictly between 
finite languages and regular languages. 

Proof. Consider the regular infinite language {jfx$ \ x G {a, &}*}. The minimal automaton recog- 
nizing this language is prefix-range-sorted, as each node has a distinct label. 

Now consider the regular language L = {jfx$ \ x G {a, b}* U {a,c}*}. Assume that there is a 
prefix-range-sorted automaton recognizing language L. Suffixes B n = a n b% and C n = a n c% must be 
recognized from different nodes, as bB n is a suffix of language L, while bC n is not. Suffixes B n and 
B n+ \ must also be recognized from different nodes, as B n+ \ < C n+ \ < B n . As the automaton must 
have an infinite number of nodes, it cannot be prefix-range-sorted. □ 
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Fig. 3. A reverse deterministic automaton corresponding to the first 10 positions of the multiple alignment in Figure 1. 



ACTA CTA ACTG 




Fig. 4. A prefix-sorted automaton built for the automaton in Figure 3. The strings above nodes are prefixes p(v). 
Table 1. GCSA for the automaton in Figure 4. Nodes are identified by prefixes p(v). 

$ ACC ACG ACTA ACTG AG AT CC CG CTA CTG G$ GA GT TA TG$ TGT # 

BWT G T G G T T G A A A AC AT #— CT CG- C A $ 
F 1 1 1 1 1 1 1 1 1 1 10 10 100 10 100 1 1 1 
M 11 1 1 1 1 1 1 1 1 10 10 111 10 111 1 11 



5 Index construction 

Our construction algorithm is related to the prefix- doubling approach to suffix array construction 
[22]. We start with a reverse deterministic automaton (see Figure 3), convert it to an equivalent 
prefix-range-sorted automaton (Figure 4), and build the GCSA (Table 1) for that automaton. 
The algorithm consists mostly of sorting, scanning, and database joins. Hence it can be efficiently 
implemented in parallel, distributed, and external memory settings. 

Theorem 3. Assume we have a length n multiple alignment of r sequences from alphabet of size 
a. We can build a prefix-range-sorted automaton recognizing all paths through the alignment in 
0(nrlogr + \V'\ log|V| logn + \E'\) time and 0{nr\oga + \V'\ log| V r | + \E'\ log|£"|) bits of space, 
where V' and E' are the largest intermediate sets of nodes and edges, respectively. 

Proof. From Lemmas 1, 2, and 3 below. □ 

The sizes of the largest intermediate sets of nodes and edges are analyzed in a restricted model 
in the Appendix. 

5.1 Building the automaton 

In this section, we show how to construct a reverse deterministic automaton from a multiple align- 
ment of sequences. With a similar approach, we can also construct the automaton from a reference 
sequence and a set of SNPs. 

To efficiently build a reverse deterministic automaton, we have to modify the sequences. Let 
Si, . . . , S r be sequences of length n from alphabet UU{— }, where — represents a gap in the sequence. 
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Consider position j. For every sequence Si, let Si\j— kj\ = Ci be the first non-gap character preceding 
We say that sequences Si and Si> are equivalent at position j, if Si[j] = Si>[j] ^ — and a = cy . 
If sequences Si and SV are equivalent at position j, we move the preceding characters to position 
max(j - ki,j - ki>). 

These modifications can be done in one pass over the sequences in reverse direction. Then, if 
there are positions with a gap in every sequence, we remove these positions from the sequences. 
Finally, we form new sequences Tj = #<Si$ for all i. 

Claim. Let j be a position such that Ti[j] = Tj/[j] for some i 7^ i! , and let Tj[j — k] and Ti>[j — k'] 
be the preceding non-gap characters. If Ti[j — k] = Ti'[j — k'], then k = k' . 

Let m > be the desired context length. 3 For every sequence i and position j, where Tj[j] 7^ — , 
we define a label £(i,j) of length m + 1 consisting of the character Tj[j] and the next m non-gap 
characters (context). If there are not enough characters left, we add duplicates of $ to the end of 
the label. 

For every non-gap character Tj[j], we create a temporary node Vij with label £(vij) = Tj[j] and 
an edge (vij_k, v i,j), where Tj[j — k] is the non-gap character preceding Tj[j]. Then, for all positions 
j > 1 and all sequences Tj / Tj/, we merge nodes Vij and ivj, if £{i,j) = £(i',j), to get the actual 
nodes. For position j = 1, we merge nodes i^i for all sequences i. 

Claim. Let Vjj) and (w ^ ,j-fe', be two edges with £(vij_k) = £{vi',j-k')- Then fc = fc', 

— k) = £(i',j — k') = £(vij-k)£(i, j)[l,m], and hence nodes fij-fe and ivj-fc' will be merged, 
making the automaton reverse deterministic. 

Lemma 1. Lei re be the length of the multiple alignment, r the number of sequences, a the size 
of the alphabet, and m the context length. Building a reverse deterministic automaton A = (V, E) 
takes O(nrlogr) time and requires 0{nr\oga + \E\ \og\E\) bits of space. 

Proof. The alignment modifications can be done in 0(r log r) time per position by sorting the 
pairs (ci,ki) and checking, if there are multiple values of k per character. To create the nodes, we 
have to sort the labels £(i,j) for each position j. If we scan the alignment in reverse direction, we 
can maintain this order in 0(r) time per position. Creating the edges also takes 0(r) time per 
position, as at most r edges are created. Space requirements come from storing the alignment and 
the automaton. □ 

5.2 Creating the nodes of a prefix-sorted automaton 

Definition 3. Let A be a finite automaton recognizing a finite language, and let k > be an integer. 
Automaton A is /c-sorted if, for every node v G V, the labels of all paths from v to V\ v \ share a 
common prefix p(v, k) of length k, or if node v is prefix-sorted by prefix p(v, k) of length at most k. 

Note that every automaton is 1-sorted. Automaton A is prefix-sorted if and only if it is n-sorted, 
where n is the length of the longest string in L(A). 

Starting from a reverse deterministic automaton A = Aq, we create the nodes of automata 
= (Vi,Ei) for i = 1,2,... that are 2*-sorted, until we get an automaton that is prefix-sorted. 
For every node v 6 V*, let P(v) be the path of A corresponding to prefix p(v,2 l ). We store the 
first and the last nodes of this path as from(v) and to(v), and set rank(v) to be the lexicographic 
rank of prefix p(v,2 l ) among all distinct prefixes p(u, 2 l ) of nodes u G V%. Value sorted(v) is used 
to indicate whether the node is prefix-sorted. 

3 Two mutations within m positions in the same sequence are considered to be parts of the same mutation. Increasing 
context length generally decreases index size, construction time, and construction space. 
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Claim. Node v is prefix-sorted if and only if rank(v) is unique. 

The basic step of the algorithm is the doubling step from A± to A+i- If node u G V{ is prefix- 
sorted, we duplicate it as u> G V$+i, and set rank{w) = (rank(u), 0). Otherwise we create a joined 
node ut> G Vj+i for every node v £ Vi such that P(uv) = P(u)P(v) is a path in A, and set 
£(uv) = £(u) and rank{uv) = (rank(u),rank(v)). As path P(uv) exists if and only if there is an 
edge (to(u),from(v)) G £b, this essentially requires two database joins. When the nodes of Ai + \ 
have been created, we sort them by their ranks, and replace the pairs of integers with integer ranks. 

The doubling step is followed by the pruning step, where we merge redundant nodes. The nodes 
in Vi+i are sorted by their rank(-) values. If all nodes sharing a certain rank(-) value also share 
their from(-) node, these nodes are equivalent, and can be merged. 

Claim. After doubling and pruning steps, automaton Ai + i is 2 i+1 -sorted, and recognizes language 
L{A t ). 

Lemma 2. Creating the nodes of prefix-sorted automaton A\ takes 0(|V'| log|V'| logn) time and 
requires 0{\V'\ log|V'|) bits of space in addition to automaton A, where V' is the largest set of nodes 
during construction, and n is the length of the longest string in L(A). 

Proof. We can implement the doubling and pruning steps by scanning and sorting the nodes several 
times. Hence each step requires at most 0(|V'| log| V'|) time. As we need at most [logn] doubling 
steps to get a prefix-sorted automaton, the time bound follows. The space bound of 0(|V'| log|V'|) 
bits is the space required to store \ V'\ nodes. □ 

5.3 Creating the edges 

Let A = (V, E) be a reverse deterministic automaton recognizing a finite language, and let W 
be the set of nodes of an equivalent prefix-sorted automaton. To create the edges, we first merge 
nodes with adjacent rank(-) values, if they share their from(-) node. The resulting set V is the set 
of nodes of a prefix-range-sorted automaton A' = (V',E') equivalent to automaton A. The set of 
edges E' can be constructed efficiently from automaton A and the set of nodes V' . 

Claim. For every node v G V' , we have {from(u) \ (u,v) G E'} = {u \ (u, from{v)) G E}. Further- 
more, there are no edges (u, v), (u',v) G E' such that from(u) = from(u'). 

We generate the incoming edges (u, v) G E' as a list of pairs (from(u) , v) , sorted by (£(from(u)),rank(v)). 

Claim. Edges sorted by (£(from(u)),rank(v)) are also sorted by rank(u). 

We replace each pair (from(u),v) with edge (u,v) by scanning the sorted lists of nodes and 
edges. As every node (except the final node) has at least one outgoing edge, and no adjacent nodes 
share their from(-) value, all adjacent edges with the same from(-) value start from the current 
node. When the from(-) value changes, we advance to the next node in the list. 

Lemma 3. Creating the edge of prefix-range- sorted automaton A' takes 0(\W\ + \E'\) time and 
requires 0(\W\ log| W| + \E'\ log^'l) bits of space. 

Proof. Assuming that the set of nodes W is already sorted, merging adjacent nodes takes 0(|W|) 
time. Creating the sorted list of pairs (from(u),v) takes 0(|£"|) time, as we can output the pairs 
into a buckets according to £(from(u)), and the nodes are already sorted by rank(v). Replacing 
the pairs with edges takes 0(\E'\) time. Space complexity comes from storing A, W, and E'. □ 
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Fig. 5. Mauve output for the four different versions of chromosome 18. Regions conserved in all versions are drawn 
with mauve color (the most prominent color). The height of the color profile corresponds to the average level of 
conservation. White areas could not be aligned and are probably specific to particular genome (or long runs of N). 



6 Implementation and Experiments 

We have implemented GCSA in C++, using the components from on our implementation of RLCSA 
[17]. 4 For each character c G U U {#}, we use a gap encoded bit vector to mark the occurrences 
of c in BWT. Bit vectors F and M are run-length encoded, as they usually consist of long runs of 
1-bits. Bit vector B is gap encoded, while the samples are stored using [log(zd max + 1)] bits each, 
where id max is the largest sampled value. Block size was set to 32 bytes in all bit vectors. 

The implementation was compiled on g++ version 4.3.3. We used a system with 32 gigabytes 
of memory and two quad-core 2.53 GHz Intel Xeon E5540 processors running Ubuntu 10.04 with 
Linux kernel 2.6.32 for our experiments. Only one core was used in all experiments, except that 
GCSA construction used the parallel sorting algorithm provided by the libstdc++ parallel mode. 

We built a multiple alignment for four different assemblies of the human chromosome 18 (about 
76 Mbp each). Three of the assemblies were from NCBI 5 : the assemblies by the Genome Reference 
Consortium (GRChST), Celera Genomics (Celera), and J. Craig Venter Institute (HuRef). The 
fourth sequence 6 was from Beijing Genomics Institute (YanHuang). The sequences were aligned 
by the Mauve Multiple Genome Alignment software [3]. We ran progressiveMauve 2.3.1 assuming 
collinear genomes, as we do not support rearrangements. The multiple alignment took a few hours 
to build: about 89.4% of nucleotides aligned perfectly, 0.19% with one or more mismatches, and 
10.4% were inside of a gap. The number of gaps was high mainly because of the differences in the 
centromere region. Fig. 5 gives a rough visualization of the similarity between sequences. 

For our experiments, we constructed a GCSA (sample rate 16) for the alignment with various 
context lengths m. We also built RLCSA (sample rate 32) and BWA 0.5.8a [13] for the four se- 
quences. We searched for exact matches of 10 million Illumina/Solexa reads of length 56, sequenced 
from the whole genome, as both regular patterns and reverse complements. Table 2 lists the results 
of these experiments. As there were relatively few occurrences inside the selected chromosome, 
most of the time was spent doing find. Hence the sample rate that only affects locate had little 
effect on the overall performance. GCSA was 2.5-3 times slower than RLCSA and 3.5-4 times 
slower than BWA. About 1% of the reads matched by GCSA were not matched by the other in- 
dexes. Construction requirements for GCSA were higher than for the other indexes (see Sect. 7 for 
discussion). 

4 http: //www. cs . helsinki . f i/group/ suds/gcsa/ 

5 f tp : //f tp . ncbi . nih . gov/genomes/H_sapiens/ Assembled_chromosomes/ 

6 f tp : //public .genomics . org. cn/BGI/yanlmang/f a/ 
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Table 2. Index construction and exact matching with GCSA (sample rate 16), RLCSA (sample rate 32), and BWA 
on a multiple alignment of four sequences of human chromosome 18. Times for locate include the time used by find. 
GCSA-m denotes GCSA with context length m. 







Construction 


Matching 




Index 


Size 


Time 


Space 


Matches Find 


Locate 


GCSA-2 


69.3 MB 


10 min 


7.0 GB 


388,873 16 min 


20 min 


GCSA-4 


67.2 MB 


10 min 


6.7 GB 


388,212 16 min 


18 min 


GCSA-8 


64.7 MB 


9 min 


4.8 GB 


387,707 16 min 


18 min 


RLCSA 


165.0 MB 


5 min 


2.3 GB 


384,400 6 min 


7 min 


BWA 


212.4 MB 


4 min 


1.4 GB 


384,400 


5 min 



Table 3. Approximate matching with GCSA and RLCSA. The reported matches for given edit distance k include 
those found with smaller edit distances. 





GCSA-4 


RLCSA 


k 


Matches 


Time 


Matches Time 





388,212 


18 min 


384,400 7 min 


1 


620,263 


101 min 


609,320 39 min 


2 


876,228 


283 min 


856,373 111 min 


3 


1,146,032 


1,730 min 


1,118,719 721 min 



The performance gap between GCSA and RLCSA reflects differences in fundamental techniques, 
as the implementations share most of their basic components and design choices. Theoretically 
GCSA should be about 3 times slower, as it requires six bit vector operations per base in find, 
while RLCSA uses just two. The differences between RLCSA and BWA come from implementation 
choices, as RLCSA is intended for highly repetitive sequences and BWA for fast pattern matching 
with DNA sequences. 

To test GCSA in a more realistic alignment algorithm, we implemented BWA-like approximate 
searching [13] for both GCSA and RLCSA. There are some differences to BWA: i) we return all 
best matches; ii) we do not use a seed sequence; iii) we have no limits on gaps; and iv) we have to 
match 0(\P\ log \P\) instead of 0(|P|) characters to build the lower bound array for pattern P, as 
we have not indexed the reverse sequence. We used context length 4 for GCSA, as it had the best 
trade-off in exact matching. 

The results can be seen in Table 3. GCSA was about 2.5 times slower than RLCSA, while finding 
from 1.0% (exact matching) to 2.4% (edit distance 3) more matches in addition to those found by 
RLCSA. BWA is significantly faster (e.g. finding 1,109,668 matches with k = 3 in 40 minutes), 
as it solves a slighly different problem, ignoring a large part of the search space with biologically 
implausible edit operations. A fair comparison with BWA is currently impossible without significant 
amount of reverse engineering. With the same algorithm in all three indexes, the performance 
differences should be similar as in exact matching. 

We also tried to configure BWA to use a more similar algorithm to ours. With no limitations 
on gaps, BWA found matches for 1,156,013 reads in 257 minutes with edit distance 3, while the 
actual edit distance grew past 3 in some cases. As the exact mechanism BWA uses to handle gaps 
is unknown, we could not implement it for GCSA and RLCSA. 

Finally, we made a preliminary experiment on the SNP calling application mentioned in Sect. 1 
using in-house software. We called for SNPs from chromosome 18 with minimum coverage 2, using 
all 10 million reads, as well as only those reads with no exact matches on GCSA-4. The number 
of called SNPs was 4203 with all reads and 1074 with non-matching ones. We did not yet compare 
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how much of the reduction can be explained by exact matches on recombinants that would also be 
found using approximate search on one reference, and how much by more accurate alignment due 
to richer reference set. 

7 Discussion 

Based on our experiments, GCSA is 2.5-3 times slower than a similar implementation of CSA used 
in the same algorithm. With typical mutation rates, the index is also not much larger than a CSA 
built just for the reference sequence. Hence GCSA does not require significantly more resources 
than a regular compressed suffix array, while providing biologically relevant extended functionality. 

While our construction algorithm uses more resources than CSA construction, genomes of up to 
about 100 Mbp can be indexed on a single workstation in a reasonable time. An external memory 
implementation should allow us to build an index for the human genome and all known SNPs in less 
than a day. Extrapolating from current results, the final index should be 2.5-3 gigabytes in size. As 
a faster alternative for indexing large genomes, we are also working on a distributed construction 
algorithm in the Map Reduce framework [4]. 

To improve the running time of short read alignment and related tasks, most of the search 
space pruning mechanisms (in addition to the one mechanism we already used from [13]) to support 
approximate matching on top of BWT [11, 13, 15, 18] can be easily plugged in. Local alignment [9, 
14] can also be supported. 

As mentioned in Section 1, an obvious generalization is to index labeled weighted graphs, where 
the weights correspond to probabilities for jumping from one sequence to another in the alignment. 
This does not increase space usage significantly, as the probabilities differ from 1.0 only in nodes 
with multiple outgoing edges. In the restricted model analyzed in the Appendix, the extra space 
requirement is 0(pn log n) bits for a reference sequence of length n and mutation rate p. During the 
construction of the index, it is also easy to discard paths with small probabilities, given a threshold. 
This approach can be used e.g. to index recombinants only in the recombination hotspot areas [20]. 

The experiments conducted here aimed at demonstrating the feasibility and potential of the 
approach. Once we have the genome-scale implementation ready, we are able to test our claims 
on improving variant calling and primer design accuracy with the index. Both require wet-lab 
verification to see the true effect on reducing false positives. 
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Appendix: Expected case analysis 

In the following, all random choices are independent and identically distributed (IID). 

We analyze the size of the automata created by the doubling algorithm in the following model. 
Let S[l, n] be a reference sequence, and let p be the mutation rate. For each position i = 1, . . . , n, 
the initial automaton A has a node Ui with label l(ui) = S[i], randomly chosen from alphabet U. 
With probability p, there is also another node wi with a random label £(wi) G U \ {S[i]}. The 
automaton has edges from all nodes at position % to all nodes at position i + 1 for all i. 

Definition 4. Let k > be an integer. A /c-path in an automaton is a path of length k, or a 
shorter path ending at the final node. 

Let k > be an integer. For any position i, let X ijk be the number /c-paths starting from position 
i in the reference sequence. If there are j mutated positions covered by these paths, then X^k = 2 J , 
and each of the paths has a different label. The number of mutations is binomially distributed, 
with the path length and the mutation probability as the parameters. From the moment-generating 
function for binomial distribution, we get 

k 

E [X hk ] = ^Pr(X l)fe = j)2? < (l+p) k . (1) 

3=0 

For positions i = 1, . . . ,n — k + 1, this is an equality. 

Lemma 4. Let Ah be a 2 h -sorted automaton equivalent to the original automaton A. Then N(2 h ) = 
n(l + p) 2h + 2 is an upper bound for the expected number of nodes in A^. 

Proof. For every 2 fe -path starting from a position i in the reference sequence, there is at most one 
node in automaton Ah- On the other hand, every node in the automaton, except for the initial 
and the final nodes, corresponds to a path that can be extended to some such 2^-path. Hence the 
total number of nodes is at most Y17=i ^i,k + 2. By Eq. 1, the expected number of nodes is at most 
N(2 h ). ' □ 

Lemma 5. Let A^ be the 2 h -sorted automaton built from automaton A. Then N(2 h )(l + p) is an 
upper bound for the expected number of edges in Ah . 

Proof. The indegree of the initial node of Ah is 0. For every other node v, let pos(v) be the 
position of the reference sequence corresponding to from(v). If from(v) is the final node of A, then 
pos{v) = n + 1. If there is no mutation at position pos(v) — 1, then in(v) = 1. Otherwise in(v) = 2. 
Hence the expected number of edges is at most (1 + p) times the number of nodes. □ 

Consider the expectation E [Xj^Xj/^] for a pair of text positions i < i'. If i' > i + k, then the 
random variables are independent, and the expectation becomes 

E [X itk Xi, tk ] = E [X itk ] E [X Vtk ] < (1 + P? k - (2) 

Otherwise assume that the paths starting from positions i and i' overlap in k' < k positions. Then 
the expectation is a product of the expectations of three independent random variables X ijk _ k /, 
Xf, k ,, and Xi/ +k > ik - k >. By using the moment-generating function, we get 

E [X i7k Xi,, k ] < (l+p) 2 ^ k '\l + 3pf < (l+pf k . (3) 
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Definition 5. A pair of nodes of automaton Ah collides, if the corresponding 2 h -paths have iden- 
tical labels. 

Lemma 6. Let Ah be the 2 h -sorted automaton built from automaton A by using the doubling al- 
gorithm. The expected number of colliding pairs of nodes in automaton Ah is at most C(2 h ) = 
n 2 (l+p) 3 - 2 V<7 2 \ 



Proof. If two paths start from the same position in the reference sequence, the corresponding nodes 
cannot collide. As the colliding paths must be of length 2 h (otherwise the nodes would be prefix- 
sorted), the probability of collision of any given pair is a~ 2h . By Equations 2 and 3, the expected 
number of colliding pairs is at most 



EEjW^] <n 2 (l+pf 2 V 



Ki' 

The lemma follows. □ 

Theorem 4. Let n be the length of the reference sequence, a the size of the alphabet, and p < 
cr 1 / 3 — 1 the mutation rate. For any e > 0, the largest automaton created by the doubling algorithm 
has at most n(l + p) k + 2 nodes with probability 1 — e, where k = 21og CT ^r/(l — 31og CT (l + p)). 

Proof. We want to find k = 2 h , for an integer h, such that the expected number of colliding pairs 
in automaton Ah is at most e. Then, by Markov's inequality, the probability of having a colliding 
pair is at most e. If there are no colliding pairs, then the automaton is prefix-sorted. By Lemma 4, 
if this happens after h doubling and pruning phases, the expected number of nodes in the largest 
automaton created is at most N(k) = n(l + p) k + 2. 

By using the bound for the expected number of colliding pairs from Lemma 6, we get 

n 2 (l + v) 2,k log — 

C ^ = \ ^ e i , i h a. s ^ k - 

a k 1 - 31og CT (l +p) 

2 

As k has to be a power of two, 21og CT ^/(l — 31og CT (l + p)) is an upper bound for the smallest 
suitable k. □ 

Corollary 1. For a random reference sequence of length n and mutation ratep < 0.08, the expected 
number of edges in the largest automaton is at most n(l + p^°( lo s n ) _|_ 0(1). 

Proof. For p < 0.08, the k in Theorem 4 is at most 31og^r. By selecting e = (^)\ we get node 
bound n(l +p) 3 '( 2 +*) lo s n + 2 with probability 1 — e. Hence the expected number of nodes is at most 

n{\ + p) 91ogn ( ) + 2 < n(l + p)°^ n ) + 2. 

i=o ^ n ' 

By Lemma 5, the expected number of edges is at most (1 + p) times that. □ 
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